Dataset: Financial Contributions to Presidential Campaigns (Ohio State)
Time: 2016
The reason to choose this dataset:
Ohio is known as a swing state which could forecast the election result by the status of Ohio state.
To know the basic about this dataset.
## the total number of row in oh_data: 164475
## [1] "cmte_id" "cand_id" "cand_nm"
## [4] "contbr_nm" "contbr_city" "contbr_st"
## [7] "contbr_zip" "contbr_employer" "contbr_occupation"
## [10] "contb_receipt_amt" "contb_receipt_dt" "receipt_desc"
## [13] "memo_cd" "memo_text" "form_tp"
## [16] "file_num" "tran_id" "election_tp"
## [19] "party" "Month_Yr" "Day_Month"
## [22] "weekday" "surname" "gender"
## cmte_id cand_id cand_nm
## C00575795:71194 P00003392:71194 Clinton, Hillary Rodham :71194
## C00577130:34686 P60007168:34686 Sanders, Bernard :34686
## C00580100:24166 P80001571:24166 Trump, Donald J. :24166
## C00574624:16406 P60006111:16406 Cruz, Rafael Edward 'Ted':16406
## C00573519: 7937 P60005915: 7937 Carson, Benjamin S. : 7937
## C00581876: 4824 P60003670: 4824 Kasich, John R. : 4824
## (Other) : 5262 (Other) : 5262 (Other) : 5262
## contbr_nm contbr_city contbr_st
## STOWE, JANICE : 277 COLUMBUS : 17328 OH:164475
## MISSLER, ANDREW J. MR.: 203 CINCINNATI: 15630
## BRIONES, BERTA : 179 CLEVELAND : 5778
## MOESER, DONALD : 176 DAYTON : 4634
## CUMMINGS, JOHN : 142 TOLEDO : 3287
## SCHEEL, PATRICK : 133 AKRON : 3206
## (Other) :163365 (Other) :114612
## contbr_zip contbr_employer
## Min. : 10 RETIRED :27097
## 1st Qu.:431109498 N/A :22434
## Median :440942900 SELF-EMPLOYED : 8353
## Mean :368573923 NONE : 7638
## 3rd Qu.:450131451 INFORMATION REQUESTED: 7611
## Max. :458969665 (Other) :91213
## NA's :3 NA's : 129
## contbr_occupation contb_receipt_amt contb_receipt_dt
## RETIRED :43434 Min. :-10800 Min. :2014-07-17
## NOT EMPLOYED :10378 1st Qu.: 16 1st Qu.:2016-02-29
## INFORMATION REQUESTED: 7549 Median : 28 Median :2016-05-31
## ATTORNEY : 3320 Mean : 120 Mean :2016-05-16
## HOMEMAKER : 3234 3rd Qu.: 80 3rd Qu.:2016-08-25
## (Other) :96538 Max. : 29100 Max. :2016-11-28
## NA's : 22
## receipt_desc memo_cd
## :162495 :127925
## Refund : 887 X: 36550
## REDESIGNATION FROM PRIMARY: 211
## REDESIGNATION TO GENERAL : 210
## REATTRIBUTION TO SPOUSE : 114
## REATTRIBUTION FROM SPOUSE : 112
## (Other) : 446
## memo_text form_tp
## :114599 SA17A:128232
## * EARMARKED CONTRIBUTION: SEE BELOW: 33677 SA18 : 35356
## * HILLARY VICTORY FUND : 14385 SB28A: 887
## EARMARKED FROM MAKE DC LISTEN : 282
## *BEST EFFORTS UPDATE : 246
## REDESIGNATION FROM PRIMARY : 211
## (Other) : 1075
## file_num tran_id election_tp
## Min. :1003942 A80E77D0E713E417AA88: 3 : 522
## 1st Qu.:1077664 C11887628 : 3 G2016: 56271
## Median :1096260 C10225661 : 2 P2016:107682
## Mean :1095976 C10228611 : 2
## 3rd Qu.:1119042 C10230213 : 2
## Max. :1134173 C10234145 : 2
## (Other) :164461
## party Month_Yr Day_Month weekday
## Length:164475 2016-10:18582 Min. : 1.00 Monday :26927
## Class :character 2016-07:18208 1st Qu.: 8.00 Tuesday :29339
## Mode :character 2016-03:16599 Median :15.00 Wednesday:29176
## 2016-08:14777 Mean :16.04 Thursday :23544
## 2016-04:14059 3rd Qu.:25.00 Friday :24160
## 2016-02:13335 Max. :31.00 Saturday :16619
## (Other):68915 Sunday :14710
## surname gender.gender
## Length:164475 Length:164475
## Class :character Class :character
## Mode :character Mode :character
##
##
##
##
From the output and the definition of variables, I could know about the types of variables and decide the next exploration step.
the total number of row in oh_data: 164475 rows After enrichment, there are 24 variables.
The key questions I would like to anwser through this dataset are:
1) if there is any correlation between contributed amount and the voting result?
2) if there is any patterns for people donate funding? e.g. occupation, gender, city they live
Key variable: donation amount (contb_receipt_amt, numeric variable)
other numeric variable for exploring distribution: N/A
some important non-numeric variables: candidate names(cand_nm), gender(gender), occupation(contbr_occupation), cities (contbr_city), party(party)
The distribution is quite spread and there are some negative numbers due to refund. For having a better view on donation amount, I used natural logarithm, log base 10, to transform my plot. With logarithm, I can see that the most common donation amount is around US$50.1 (10^1.7) - US$75(10^1.87)
## Donated amount range: -10800 29100
I am using log base 10 for monetary amounts, because orders of ten seem natural for money: $100, $1000,$10,000, and so on. The transformed data is easy to read.
## 24 unique candidates
## 6555 unique occupations
## 1341 unique contributed cities
By plotting the bar charts and counting unique numbers of these non-numeric variables, there are too many unique data in terms of occupations and cities. It is difficult to read data from the graphs if ploting all occupations or cities so I plotted top 15 occupations and top 10 cities which contributed the most funding.
In terms of candidates, there are only 24 unique candidates so I used horizontal bar chart to show the full name of candidates. From the bar chart, Hiliary Clinton, Bernard Sanders and Donald Trump got the most contributed amount in Ohio state.
Regarding to occupations, retired people contributed the most donation in Ohio. For city distribution, Columbas, Cincinnati and Clevenland are the top 3 cities to sponsor the election campaigns.
## cmte_id cand_id cand_nm contbr_nm
## 3 C00575795 P00003392 Clinton, Hillary Rodham HILSON, ANN
## 4 C00577130 P60007168 Sanders, Bernard HARTFORD, MARK
## 6 C00577130 P60007168 Sanders, Bernard MARTINEZ, JORGE
## 9 C00574624 P60006111 Cruz, Rafael Edward 'Ted' KERR, JAMES L. MR.
## 11 C00575795 P00003392 Clinton, Hillary Rodham NORMA, CORRADO
## 12 C00575795 P00003392 Clinton, Hillary Rodham BLOUNT, ROBERT
## contbr_city contbr_st contbr_zip contbr_employer
## 3 COLUMBUS OH 432141210 INFORMATION REQUESTED
## 4 COLUMBUS OH 432022420 NOT EMPLOYED
## 6 CINCINNATI OH 45249 SELF
## 9 COLUMBUS OH 432133419 RETIRED
## 11 COLUMBUS OH 432242065 N/A
## 12 COLUMBUS OH 432075166 INFORMATION REQUESTED
## contbr_occupation contb_receipt_amt contb_receipt_dt receipt_desc
## 3 INFORMATION REQUESTED 40.0 2016-04-12
## 4 NOT EMPLOYED 50.0 2016-03-06
## 6 ATTORNEY AT LAW 2.5 2016-03-05
## 9 RETIRED 70.0 2016-04-01
## 11 RETIRED 100.0 2016-03-31
## 12 INFORMATION REQUESTED 100.0 2016-04-13
## memo_cd memo_text form_tp file_num
## 3 X * HILLARY VICTORY FUND SA18 1091718
## 4 * EARMARKED CONTRIBUTION: SEE BELOW SA17A 1077404
## 6 * EARMARKED CONTRIBUTION: SEE BELOW SA17A 1077404
## 9 SA17A 1077664
## 11 X *BEST EFFORTS UPDATE SA17A 1091718
## 12 X * HILLARY VICTORY FUND SA18 1091718
## tran_id election_tp party Month_Yr Day_Month weekday
## 3 C4715778 P2016 Democratic 2016-04 12 Tuesday
## 4 VPF7BKYX9P5 P2016 Other 2016-03 6 Sunday
## 6 VPF7BKX86M2 P2016 Other 2016-03 5 Saturday
## 9 SA17A.1533358 P2016 Republican 2016-04 1 Friday
## 11 C3925076 P2016 Democratic 2016-03 31 Thursday
## 12 C4721952 P2016 Democratic 2016-04 13 Wednesday
## surname gender
## 3 ANN female
## 4 MARK male
## 6 JORGE male
## 9 JAMES male
## 11 CORRADO male
## 12 ROBERT male
The election was on Nov. in 2016 but the donation started from March 2015 and reached the first peak on March 2016. The donation reached the highest peak on October 2016.
Although there are more donation records for Democratic party, there are more donated amount for Republican party. It might be caused by the average donation to Republican is higher.
The proportion of gender is almost equal (female : male is around 5 : 5)
I listed top 10 cities in terms of donation records and donation amounts. Take Columbus as an example, there are the most donation records among the cities but the donation amount is not the top 1 city. It shows that some cities might have more relatively small amount of donation.
## # A tibble: 1,341 × 2
## contbr_city total_amount
## <fctr> <dbl>
## 1 CINCINNATI 2605688.7
## 2 COLUMBUS 2226563.1
## 3 CLEVELAND 866239.9
## 4 CHAGRIN FALLS 383091.9
## 5 DUBLIN 379636.9
## 6 SHAKER HEIGHTS 376150.9
## 7 AKRON 358729.6
## 8 DAYTON 353846.1
## 9 CANTON 277801.2
## 10 WESTERVILLE 254291.7
## # ... with 1,331 more rows
## # A tibble: 10 × 2
## contbr_city total_amount
## <fctr> <dbl>
## 1 CINCINNATI 2605688.7
## 2 COLUMBUS 2226563.1
## 3 CLEVELAND 866239.9
## 4 CHAGRIN FALLS 383091.9
## 5 DUBLIN 379636.9
## 6 SHAKER HEIGHTS 376150.9
## 7 AKRON 358729.6
## 8 DAYTON 353846.1
## 9 CANTON 277801.2
## 10 WESTERVILLE 254291.7
There are 164,475 obs in the Ohio dataset with 18 original varibles. For analysis purpose, I added 6 extra varibles (party, Month_Yr, weekday, day of month, surname and gender)
The main features in the data set are “contb_receipt_amt” and the factors influencing the amounts. I’d like to find out which features have the most impact on raising more contributed amounts and I’d like to provide a few suggestions for candidates in the future when running a election found-raising campaign. I suspect city, occupation and day of week matter.
Since 2016 American presidential election result has came out, it would be great to do comparison analysis between contributed amount data and the final voting result data. I downloaded the voting result data for analyzing the correlation between contributed amount and the voters in Ohio. (The analysis is covered in the next section.)
Yes, I create 3 variables for further analysis. The 3 variables are listed below.
1) Party: I categorized data into 3 categories(D, R, Other) based on candidate name
2) Month_Yr: showing the contributed amount trend by month
3) weekday: analyzing if there is a huge difference between weekday and weekend.
4) Day_Month: the day of month 5) surname: for predicting the gender by gender library 6) gender: the gender of the contributors
I enriched the Ohio dataset with Zipcode to visualize the contributed amount on Ohio map.(The analysis is conducted in multivariate plots section.)
After merging with Ohio zipcode data from Zipcode library, I found there are 83 potential wrong zipcode data so I excluded them when I was plotting the contributed amount on the map. The reason why I excluded is that it is hard to identify the correct zipcode simply based on city names.
Since the 2016 election result already came out, I enriched the original political finance dataset with vote data, which I found online(link). Since I only analyzed Ohio data, I picked out the Ohio voter data from the national vote data by using subset and mapped with my original contribution amount data by city name. (which in the vote data, the column called “county_name”)
Regarding to the mapping, there are different format of city names. There is “county” at the end of each city name in vote data so I have to trim the word of “county”. The cases of letter in these 2 data frames are different as well. To properly map the data, I chang the cases of letter by the function of “tolower”.
## Source: local data frame [2,475 x 4]
## Groups: contbr_city [?]
##
## contbr_city party count total_amount
## <chr> <chr> <int> <dbl>
## 1 batavia Republican 1 500.00
## 2 45320 Republican 1 80.00
## 3 aberdeen Democratic 5 900.00
## 4 aberdeen Republican 2 44.00
## 5 ada Democratic 97 4272.00
## 6 ada Other 43 3682.88
## 7 ada Republican 18 1458.00
## 8 adams county Republican 1 80.00
## 9 addyston Democratic 11 392.55
## 10 addyston Republican 3 190.00
## # ... with 2,465 more rows
## contbr_city Democratic Other Republican
## 1 batavia 0.00 0.00 500
## 2 45320 0.00 0.00 80
## 3 aberdeen 900.00 0.00 44
## 4 ada 4272.00 3682.88 1458
## 5 adams county 0.00 0.00 80
## 6 addyston 392.55 0.00 190
## [1] "New dataset of donation amount and votes"
## contbr_city amount_D amount_Other amount_R votes_D votes_R total_votes
## 1 allen 0.00 0.00 80.00 12815 29858 44636
## 2 ashland 3927.04 657.00 20966.37 5659 17169 24074
## 3 ashtabula 4441.30 4928.40 14448.85 15191 22755 39809
## 4 athens 51808.95 18469.83 12721.48 15552 10816 27941
## 5 belmont 1759.50 0.00 39326.00 8652 20729 30537
## 6 butler 768.00 1028.60 1184.60 56700 104441 168422
## total_amount
## 1 80.00
## 2 25550.41
## 3 23818.55
## 4 83000.26
## 5 41085.50
## 6 2981.20
A minor finding might be that the correlation between donation amount and votes is stronger for Republican supporters (the correlation corefficient: 0.4)
So I tried to exclude outliers by limiting x-axis and y-axis to focus on the bulk data below. The slope for Plot1-2 (Republican party data) is slightly steeper than Plot1-3(Democratic party).
Based on the line chart below, the wave of donation starts from late July in 2015. There might be some interesting insights to be explored.
The intuition is that there might have more donation during weekend.
But surprisingly, there are more donation on weekday. I guess it might be the different lifestyle between Asia and United States.
I noticed that the relationship between distributed amount and the number of voters is not positively strong. It seems to have week relationship which is against my original assumption.
When dicussing the relationship between the contributed amount and the toal voters, Republican party supporters show stronger correlation than Democratic party supporters.
The correlation coefficient between contributed amount and voter numbers 1)Republican party : 0.401 2)Democratic party : 0.184
The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307)
The relationship between the total contributed amount and the contributed amount of Republican party is super relative (the correlation coefficient is 0.934) because the contributed amount from Republican party supporters accounts for ~60%.
However, this is not a proper pair to check the relationship because these 2 factors are not independent.
In order to increase more donation amount, I would also like to find out which occupation contributed the most in terms of the number of donation and the average amount which they donated each time.
Based on the graph below, I could see that “Attorney” and “Homemaker” have the higher average donation amount. It provides the direction of party platform which might need to be beneficial for these 2 occupations.
## [1] "Top10 occupation:"
## [1] RETIRED NOT EMPLOYED INFORMATION REQUESTED
## [4] ATTORNEY HOMEMAKER PHYSICIAN
## [7] TEACHER PROFESSOR ENGINEER
## [10] SALES
## 6554 Levels: CERTIFIED REGISTERED NURSE ANESTHETIS - ... ZOOKEEPER
During the election period, candidates need to travel around the cities to win supports. Using the heat map would provide a clear visualization on where the most supportive cities in terms of donation amount. I enriched the original dataset with zipcode data to plot a heatmap afterwards.
## cand_nm contbr_city contbr_zip contb_receipt_amt
## 1 Cruz, Rafael Edward 'Ted' LEESBURG 451359416 25.00
## 2 Cruz, Rafael Edward 'Ted' MINERVA 446579402 25.00
## 3 Clinton, Hillary Rodham COLUMBUS 432141210 40.00
## 4 Sanders, Bernard COLUMBUS 432022420 50.00
## 5 Clinton, Hillary Rodham LEBANON 450365038 57.31
## 6 Sanders, Bernard CINCINNATI 45249 2.50
## party
## 1 Republican
## 2 Republican
## 3 Democratic
## 4 Other
## 5 Democratic
## 6 Other
## Potential error data: 83
Checking the relation between time and donation amount in top 10 cities, I could see an intersting fact that the donation amount supporting Democratic party skewwed in 2016. On the other hand, the donation to Republican party skewwed in 2015. Cincinnati and Cleveland, the two cities with the most donation amount, have the most obviouse trend.
I noticed that the major cities account for more contributed amount. After visualing on the map, it shows clearly that there are a few of heat spots in Ohio.
Checking the relation between time and donation amount in top 10 cities, I could see an intersting fact that the donation amount supporting Democratic party skewwed in 2016. On the other hand, the donation to Republican party skewwed in 2015. Cincinnati and Cleveland, the two cities with the most donation amount, have the most obviouse trend.
After distinguishing the contributed amount by party, it shows that there are more funding going to Republican party and it refelects on voting result that Republican party won Ohio at the end.
No. I tried to build a linear regression model between numeric and catergorical data but it failed and it seems to involve more complexing statistical library.
The correlation between donation amount and votes is not as strong as I expected. From the previous exploration, I found there are a lote of cities skewwed together, So I tried to exclude outliers by limiting x-axis and y-axis to focus on the bulk data. The slope for Plot1-2 (Republican party data) is slightly steeper than Plot1-3(Democratic party).
A minor finding might be that the correlation between donation amount and votes is stronger for Republican supporters (the correlation corefficient: 0.4)
The correlation coefficient between contributed amount and voter numbers for 2 parties.
1)Republican party : 0.401 (plot 1-2)
2)Democratic party : 0.184 (plot 1-3)
The correlation coefficient is higher than the correlation coefficient of total contributed amount and total voter numbers in Ohio (which is 0.307, plot 1-1)
Based on the previous exploration of contributed amount by weekday, it shows that there is lower contributed amount on weekend. The weekday finding is somewhat telling about weekday vs weekend habits but this might be expected. So I diveed into looking at contribution amounts on a broader time scale.
Checking the relation between time and donation amount in top 10 cities, I could see an intersting fact that the donation amount supporting Democratic party skewwed in 2016. On the other hand, the donation to Republican party skewwed in 2015. Cincinnati and Cleveland, the two cities with the most donation amount, have the most obviouse trend.
My guess is that the different distrinution might be caused by the party platform annoucements of each party or the campaign tour plan. I think the donation amount should be boosted everytime when a party’s candidate engages with citizen in a city.
It shows that the contributed money is mainly from city area such as Columbus, Cleveland, Akron and Cincinnati etc. It helps candidates to identify the cities to plan their future campaigns for raising more funding.
I distinguish the funding for Republican party and Democratic party by color in Plot3-1. It shows that there are more funding for Republican party in Ohio and the voting result also shows that Republican party won Ohio state.
Before starting the analysis, I assumed that the contributed amount would be a strong indicator for election result. After analyzing the relationship between the election result of Ohio and the contributed amount data of Ohio. The correlation coefficient between these 2 factors are lower than I expected and it can’t be suspected as having strong correlation between contributed amount and voter numbers.
However, this is only analyzing one state. I think, for optimizing/ further analayzing, I would suggest to analyze the data of all states in the U.S. to see if there are any strong relationship between these 2 factors.
During the analysis, I was quite struggling with more than 6,000 occupations which I thought there might be some insights to br cracked. It would be better if there are some default options for people to choose while they are making donation, such as “Retired”, “Public Servant”, “Military Soldiers” or “Teachers” etc. I could cross-check with each party’s party platform to see if party platform have any impact on donation amounts by occupation.
Problem: Error: Discrete value supplied to continuous scale all the time.
After merging the predicting gender by gender library, I got error message while I tried to plot a bar chart of gender with ggplot2. While I used ggplot(oh_data, aes(x = gender)), I got “Error: Discrete value supplied to continuous scale all the time”.
One way to correct Discrete value supplied to continuous scale all the time is by plotting discrete variable on a discrete scale. There are a few ways to do this:
1) Add scale_x_discrete() layer. –> this works!
2) use as.factor() i.e. ggplot(oh_data, aes(x=as.factor(gender))) + geom_bar() –> this didn’t work on my code. I got another error: Error in sort.list(y) : ‘x’ must be atomic for ‘sort.list’ Have you called ‘sort’ on a list?
Log Transformations for Skewed and Wide Distributions
An Introduction to corrplot Package
ggplot2 axis ticks : A guide to customize tick marks and labels
Compute the number of classes for a histogram.
Problem while loading data: duplicate ‘row.names’ are not allowed error in R programming
Why does a boxplot in ggplot requires axis x and y? What does stat means in ggplot? ggplot2 line chart gives “geom_path: Each group consist of only one observation. Do you need to adjust the group aesthetic?” Find the day of a week in R